Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 12.398
Filtrar
1.
BMC Bioinformatics ; 25(1): 146, 2024 Apr 11.
Artigo em Inglês | MEDLINE | ID: mdl-38600441

RESUMO

BACKGROUND: The advent of high-throughput technologies has led to an exponential increase in uncharacterized bacterial protein sequences, surpassing the capacity of manual curation. A large number of bacterial protein sequences remain unannotated by Kyoto Encyclopedia of Genes and Genomes (KEGG) orthology, making it necessary to use auto annotation tools. These tools are now indispensable in the biological research landscape, bridging the gap between the vastness of unannotated sequences and meaningful biological insights. RESULTS: In this work, we propose a novel pipeline for KEGG orthology annotation of bacterial protein sequences that uses natural language processing and deep learning. To assess the effectiveness of our pipeline, we conducted evaluations using the genomes of two randomly selected species from the KEGG database. In our evaluation, we obtain competitive results on precision, recall, and F1 score, with values of 0.948, 0.947, and 0.947, respectively. CONCLUSIONS: Our experimental results suggest that our pipeline demonstrates performance comparable to traditional methods and excels in identifying distant relatives with low sequence identity. This demonstrates the potential of our pipeline to significantly improve the accuracy and comprehensiveness of KEGG orthology annotation, thereby advancing our understanding of functional relationships within biological systems.


Assuntos
Proteínas de Bactérias , Processamento de Linguagem Natural , Genoma , Anotação de Sequência Molecular , Sequência de Aminoácidos
2.
BMC Bioinformatics ; 25(1): 165, 2024 Apr 25.
Artigo em Inglês | MEDLINE | ID: mdl-38664627

RESUMO

BACKGROUND: The annotation of protein sequences in public databases has long posed a challenge in molecular biology. This issue is particularly acute for viral proteins, which demonstrate limited homology to known proteins when using alignment, k-mer, or profile-based homology search approaches. A novel methodology employing Large Language Models (LLMs) addresses this methodological challenge by annotating protein sequences based on embeddings. RESULTS: Central to our contribution is the soft alignment algorithm, drawing from traditional protein alignment but leveraging embedding similarity at the amino acid level to bypass the need for conventional scoring matrices. This method not only surpasses pooled embedding-based models in efficiency but also in interpretability, enabling users to easily trace homologous amino acids and delve deeper into the alignments. Far from being a black box, our approach provides transparent, BLAST-like alignment visualizations, combining traditional biological research with AI advancements to elevate protein annotation through embedding-based analysis while ensuring interpretability. Tests using the Virus Orthologous Groups and ViralZone protein databases indicated that the novel soft alignment approach recognized and annotated sequences that both blastp and pooling-based methods, which are commonly used for sequence annotation, failed to detect. CONCLUSION: The embeddings approach shows the great potential of LLMs for enhancing protein sequence annotation, especially in viral genomics. These findings present a promising avenue for more efficient and accurate protein function inference in molecular biology.


Assuntos
Algoritmos , Anotação de Sequência Molecular , Alinhamento de Sequência , Anotação de Sequência Molecular/métodos , Alinhamento de Sequência/métodos , Proteínas Virais/genética , Proteínas Virais/química , Genes Virais , Bases de Dados de Proteínas , Biologia Computacional/métodos , Sequência de Aminoácidos
3.
Bioinformatics ; 40(4)2024 Mar 29.
Artigo em Inglês | MEDLINE | ID: mdl-38640488

RESUMO

MOTIVATION: The ENCODE project generated a large collection of eCLIP-seq RNA binding protein (RBP) profiling data with accompanying RNA-seq transcriptomes of shRNA knockdown of RBPs. These data could have utility in understanding the functional impact of genetic variants, however their potential has not been fully exploited. We implement INCA (Integrative annotation scores of variants for impact on RBP activities) as a multi-step genetic variant scoring approach that leverages the ENCODE RBP data together with ClinVar and integrates multiple computational approaches to aggregate evidence. RESULTS: INCA evaluates variant impacts on RBP activities by leveraging genotypic differences in cell lines used for eCLIP-seq. We show that INCA provides critical specificity, beyond generic scoring for RBP binding disruption, for candidate variants and their linkage-disequilibrium partners. As a result, it can, on average, augment scoring of 46.2% of the candidate variants beyond generic scoring for RBP binding disruption and aid in variant prioritization for follow-up analysis. AVAILABILITY AND IMPLEMENTATION: INCA is implemented in R and is available at https://github.com/keleslab/INCA.


Assuntos
Proteínas de Ligação a RNA , Humanos , Proteínas de Ligação a RNA/metabolismo , Proteínas de Ligação a RNA/genética , Software , Variação Genética , Biologia Computacional/métodos , Anotação de Sequência Molecular/métodos
4.
BMC Genomics ; 25(1): 405, 2024 Apr 24.
Artigo em Inglês | MEDLINE | ID: mdl-38658835

RESUMO

Graph-based pangenome is gaining more popularity than linear pangenome because it stores more comprehensive information of variations. However, traditional linear genome browser has its own advantages, especially the tremendous resources accumulated historically. With the fast-growing number of individual genomes and their annotations available, the demand for a genome browser to visualize genome annotation for many individuals together with a graph-based pangenome is getting higher and higher. Here we report a new pangenome browser PPanG, a precise pangenome browser enabling nucleotide-level comparison of individual genome annotations together with a graph-based pangenome. Nine rice genomes with annotations were provided by default as potential references, and any individual genome can be selected as the reference. Our pangenome browser provides unprecedented insights on genome variations at different levels from base to gene, and reveals how the structures of a gene could differ for individuals. PPanG can be applied to any species with multiple individual genomes available and it is available at https://cgm.sjtu.edu.cn/PPanG .


Assuntos
Genômica , Genômica/métodos , Oryza/genética , Anotação de Sequência Molecular , Genoma de Planta , Variação Genética , Software , Navegador , Bases de Dados Genéticas , Nucleotídeos/genética , Genoma
5.
Microb Genom ; 10(4)2024 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-38668652

RESUMO

Accurate annotation to single-nucleotide resolution of the transcribed regions in genomes is key to optimally analyse RNA-seq data, understand regulatory events and for the design of experiments. However, currently most genome annotations provided by GenBank generally lack information about untranslated regions. Additionally, information regarding genomic locations of non-coding RNAs, such as sRNAs, or anti-sense RNAs is frequently missing. To provide such information, diverse RNA-seq technologies, such as Rend-seq, have been developed and applied to many bacterial species. However, incorporating this vast amount of information into annotation files has been limited and is bioinformatically challenging, resulting in UTRs and other non-coding elements being overlooked or misrepresented. To overcome this problem, we present pyRAP (python Rend-seq Annotation Pipeline), a software package that analyses Rend-seq datasets to accurately resolve transcript boundaries genome-wide. We report the use of pyRAP to find novel transcripts, transcript isoforms, and RNase-dependent sRNA processing events. In Bacillus subtilis we uncovered 63 novel transcripts and provide genomic coordinates with single-nucleotide resolution for 2218 5'UTRs, 1864 3'UTRs and 161 non-coding RNAs. In Escherichia coli, we report 117 novel transcripts, 2429 5'UTRs, 1619 3'UTRs and 91 non-coding RNAs, and in Staphylococcus aureus, 16 novel transcripts, 664 5'UTRs, 696 3'UTRs, and 81 non-coding RNAs. Finally, we use pyRAP to produce updated annotation files for B. subtilis 168, E. coli K-12 MG1655, and S. aureus 8325 for use in the wider microbial genomics research community.


Assuntos
Bacillus subtilis , Genoma Bacteriano , Anotação de Sequência Molecular , Software , Bacillus subtilis/genética , Escherichia coli/genética , RNA Bacteriano/genética , Staphylococcus aureus/genética , Biologia Computacional/métodos , Análise de Sequência de RNA/métodos , RNA-Seq/métodos
6.
Gigascience ; 132024 Jan 02.
Artigo em Inglês | MEDLINE | ID: mdl-38626724

RESUMO

BACKGROUND: The accurate identification of the functional elements in the bovine genome is a fundamental requirement for high-quality analysis of data informing both genome biology and genomic selection. Functional annotation of the bovine genome was performed to identify a more complete catalog of transcript isoforms across bovine tissues. RESULTS: A total of 160,820 unique transcripts (50% protein coding) representing 34,882 unique genes (60% protein coding) were identified across tissues. Among them, 118,563 transcripts (73% of the total) were structurally validated by independent datasets (PacBio isoform sequencing data, Oxford Nanopore Technologies sequencing data, de novo assembled transcripts from RNA sequencing data) and comparison with Ensembl and NCBI gene sets. In addition, all transcripts were supported by extensive data from different technologies such as whole transcriptome termini site sequencing, RNA Annotation and Mapping of Promoters for the Analysis of Gene Expression, chromatin immunoprecipitation sequencing, and assay for transposase-accessible chromatin using sequencing. A large proportion of identified transcripts (69%) were unannotated, of which 86% were produced by annotated genes and 14% by unannotated genes. A median of two 5' untranslated regions were expressed per gene. Around 50% of protein-coding genes in each tissue were bifunctional and transcribed both coding and noncoding isoforms. Furthermore, we identified 3,744 genes that functioned as noncoding genes in fetal tissues but as protein-coding genes in adult tissues. Our new bovine genome annotation extended more than 11,000 annotated gene borders compared to Ensembl or NCBI annotations. The resulting bovine transcriptome was integrated with publicly available quantitative trait loci data to study tissue-tissue interconnection involved in different traits and construct the first bovine trait similarity network. CONCLUSIONS: These validated results show significant improvement over current bovine genome annotations.


Assuntos
Perfilação da Expressão Gênica , Genômica , Bovinos/genética , Animais , Análise de Sequência de RNA , Transcriptoma , Locos de Características Quantitativas , RNA , Isoformas de Proteínas , Anotação de Sequência Molecular
7.
Sci Data ; 11(1): 351, 2024 Apr 08.
Artigo em Inglês | MEDLINE | ID: mdl-38589366

RESUMO

Acanthacorydalis orientalis (McLachlan, 1899) (Megaloptera: Corydalidae) is an important freshwater-benthic invertebrate species that serves as an indicator for water-quality biomonitoring and is valuable for conservation from East Asia. Here, a high-quality reference genome for A. orientalis was constructed using Oxford Nanopore sequencing and High throughput Chromosome Conformation Capture (Hi-C) technology. The final genome size is 547.98 Mb, with the N50 values of contig and scaffold being 7.77 Mb and 50.53 Mb, respectively. The longest contig and scaffold are 20.57 Mb and 62.26 Mb in length, respectively. There are 99.75% contigs anchored onto 13 pseudo-chromosomes. Benchmarking Universal Single-Copy Orthologs (BUSCO) analysis showed that the completeness of the genome assembly is 99.01%. There are 10,977 protein-coding genes identified, of which 84.00% are functionally annotated. The genome contains 44.86% repeat sequences. This high-quality genome provides substantial data for future studies on population genetics, aquatic adaptation, and evolution of Megaloptera and other related insect groups.


Assuntos
Genoma de Inseto , Neópteros , Sequências Repetitivas de Ácido Nucleico , Cromossomos/genética , Anotação de Sequência Molecular , Filogenia , Neópteros/genética
8.
BMC Genomics ; 25(1): 346, 2024 Apr 05.
Artigo em Inglês | MEDLINE | ID: mdl-38580907

RESUMO

BACKGROUND: The yak (Bos grunniens) is a large ruminant species that lives in high-altitude regions and exhibits excellent adaptation to the plateau environments. To further understand the genetic characteristics and adaptive mechanisms of yak, we have developed a multi-omics database of yak including genome, transcriptome, proteome, and DNA methylation data. DESCRIPTION: The Yak Genome Database ( http://yakgenomics.com/ ) integrates the research results of genome, transcriptome, proteome, and DNA methylation, and provides an integrated platform for researchers to share and exchange omics data. The database contains 26,518 genes, 62 transcriptomes, 144,309 proteome spectra, and 22,478 methylation sites of yak. The genome module provides access to yak genome sequences, gene annotations and variant information. The transcriptome module offers transcriptome data from various tissues of yak and cattle strains at different developmental stages. The proteome module presents protein profiles from diverse yak organs. Additionally, the DNA methylation module shows the DNA methylation information at each base of the whole genome. Functions of data downloading and browsing, functional gene exploration, and experimental practice were available for the database. CONCLUSION: This comprehensive database provides a valuable resource for further investigations on development, molecular mechanisms underlying high-altitude adaptation, and molecular breeding of yak.


Assuntos
Multiômica , Proteoma , Animais , Bovinos/genética , Proteoma/genética , Genoma , Transcriptoma , Anotação de Sequência Molecular
9.
Brief Bioinform ; 25(3)2024 Mar 27.
Artigo em Inglês | MEDLINE | ID: mdl-38581418

RESUMO

Following the milestone success of the Human Genome Project, the 'Encyclopedia of DNA Elements (ENCODE)' initiative was launched in 2003 to unearth information about the numerous functional elements within the genome. This endeavor coincided with the emergence of numerous novel technologies, accompanied by the provision of vast amounts of whole-genome sequences, high-throughput data such as ChIP-Seq and RNA-Seq. Extracting biologically meaningful information from this massive dataset has become a critical aspect of many recent studies, particularly in annotating and predicting the functions of unknown genes. The core idea behind genome annotation is to identify genes and various functional elements within the genome sequence and infer their biological functions. Traditional wet-lab experimental methods still rely on extensive efforts for functional verification. However, early bioinformatics algorithms and software primarily employed shallow learning techniques; thus, the ability to characterize data and features learning was limited. With the widespread adoption of RNA-Seq technology, scientists from the biological community began to harness the potential of machine learning and deep learning approaches for gene structure prediction and functional annotation. In this context, we reviewed both conventional methods and contemporary deep learning frameworks, and highlighted novel perspectives on the challenges arising during annotation underscoring the dynamic nature of this evolving scientific landscape.


Assuntos
Aprendizado Profundo , Humanos , Genoma , Algoritmos , Software , Biologia Computacional/métodos , Anotação de Sequência Molecular
10.
Mol Biol Evol ; 41(4)2024 Apr 02.
Artigo em Inglês | MEDLINE | ID: mdl-38577785

RESUMO

Transposable elements (TEs) are major components of eukaryotic genomes and are implicated in a range of evolutionary processes. Yet, TE annotation and characterization remain challenging, particularly for nonspecialists, since existing pipelines are typically complicated to install, run, and extract data from. Current methods of automated TE annotation are also subject to issues that reduce overall quality, particularly (i) fragmented and overlapping TE annotations, leading to erroneous estimates of TE count and coverage, and (ii) repeat models represented by short sections of total TE length, with poor capture of 5' and 3' ends. To address these issues, we present Earl Grey, a fully automated TE annotation pipeline designed for user-friendly curation and annotation of TEs in eukaryotic genome assemblies. Using nine simulated genomes and an annotation of Drosophila melanogaster, we show that Earl Grey outperforms current widely used TE annotation methodologies in ameliorating the issues mentioned above while scoring highly in benchmarking for TE annotation and classification and being robust across genomic contexts. Earl Grey provides a comprehensive and fully automated TE annotation toolkit that provides researchers with paper-ready summary figures and outputs in standard formats compatible with other bioinformatics tools. Earl Grey has a modular format, with great scope for the inclusion of additional modules focused on further quality control and tailored analyses in future releases.


Assuntos
Elementos de DNA Transponíveis , Drosophila melanogaster , Animais , Elementos de DNA Transponíveis/genética , Anotação de Sequência Molecular , Drosophila melanogaster/genética , Genômica/métodos , Biologia Computacional
11.
Sci Data ; 11(1): 340, 2024 Apr 05.
Artigo em Inglês | MEDLINE | ID: mdl-38580722

RESUMO

Despite the rapid advances in sequencing technology, limited genomic resources are currently available for phytophagous spider mites, which include many important agricultural pests. One of these pests is Tetranychus piercei (McGregor), a serious banana pest in East Asia exhibiting remarkable tolerance to high temperature. In this study, we assembled a high-quality genome of T. piercei using a combination of PacBio long reads and Illumina short reads sequencing. With the assistance of chromatin conformation capture technology, 99.9% of the contigs were anchored into three pseudochromosomes with a total size of 86.02 Mb. Repetitive elements, accounting for 14.16% of this genome (12.20 Mb), are predominantly composed of long-terminal repeats (30.7%). By combining evidence of ab initio prediction, transcripts, and homologous proteins, we annotated 11,881 protein-coding genes. Both the genome and proteins have high BUSCO completeness scores (>94%). This high-quality genome, along with reliable annotation, provides a valuable resource for investigating the high-temperature tolerance of this species and exploring the genomic basis that underlies the host range evolution of spider mites.


Assuntos
Tetranychidae , Animais , Cromossomos , Genoma , Genômica , Anotação de Sequência Molecular , Filogenia , Sequências Repetitivas de Ácido Nucleico , Tetranychidae/genética
12.
Genome Res ; 34(3): 469-483, 2024 Apr 25.
Artigo em Inglês | MEDLINE | ID: mdl-38514204

RESUMO

With the goal of mapping genomic activity, international projects have recently measured epigenetic activity in hundreds of cell and tissue types. Chromatin state annotations produced by segmentation and genome annotation (SAGA) methods have emerged as the predominant way to summarize these epigenomic data sets in order to annotate the genome. These chromatin state annotations are essential for many genomic tasks, including identifying active regulatory elements and interpreting disease-associated genetic variation. However, despite the widespread applications of SAGA methods, no principled approach exists to evaluate the statistical significance of chromatin state assignments. Here, we propose the first method for assigning calibrated confidence scores to chromatin state annotations. Toward this goal, we performed a comprehensive evaluation of the reproducibility of the two most widely used existing SAGA methods, ChromHMM and Segway. We found that their predictions are frequently irreproducible. For example, when applying the same SAGA method on two sets of experimental replicates, 27%-69% of predicted enhancers fail to replicate. This suggests that a substantial fraction of predicted elements in existing chromatin state annotations cannot be relied upon. To remedy this problem, we introduce SAGAconf, a method for assigning a measure of confidence (r-value) to chromatin state annotations. SAGAconf works with any SAGA method and assigns an r-value to each genomic bin of a chromatin state annotation that represents the probability that the label of this bin will be reproduced in a replicated experiment. Thus, SAGAconf allows a researcher to select only the reliable predictions from a chromatin annotation for use in downstream analyses.


Assuntos
Cromatina , Anotação de Sequência Molecular , Cromatina/genética , Cromatina/metabolismo , Humanos , Reprodutibilidade dos Testes , Genômica/métodos
13.
Curr Microbiol ; 81(5): 109, 2024 Mar 11.
Artigo em Inglês | MEDLINE | ID: mdl-38466427

RESUMO

Bacteria producing urea amidohydrolases (UA) and carbonic anhydrases (CA) are of great importance in civil engineering as these enzymes are responsible for microbially induced calcium carbonate precipitation (MICCP). In this investigation, genomic insights of Bacillus paranthracis CT5 and the expression of genes underlying in MICCP were studied. B. paranthracis produced a maximum level of UA (669.3 U/ml) and CA (125 U/ml) on 5th day of incubation and precipitated 197 mg/100 ml CaCO3 after 7 days of incubation. After 28 days of curing, compressive strength of bacterial admixed and bacterial cured (B-B) specimens was 13.7% higher compared to water-mixed and water-cured (W-W) specimens. A significant decrease in water absorption was observed in bacterial-cured specimens compared to water-cured specimens after 28 days of curing. For genome analysis, reads were assembled de novo producing 5,402,771 bp assembly with N50 of 273,050 bp. RAST annotation detected six amidohydrolase and three carbonic anhydrase genes. Among 5700 coding sequences found in genome, COG gene annotation grouped 4360 genes into COG categories with highest number of genes to transcription (435 genes), amino acid transport and metabolism (362 genes) along with cell wall/membrane/envelope biogenesis and ion transport and metabolism. KEGG functional classification predicted 223 pathways consisting of 1,960 genes and the highest number of genes belongs to two-component system (101 genes) and ABC transporter pathways (98 genes) enabling bacteria to sense and respond to environmental signals and actively transport various minerals and organic molecules, which facilitate the active transport of molecules required for MICCP.


Assuntos
Bacillus , Biomineralização , Anidrases Carbônicas , Bactérias/metabolismo , Carbonato de Cálcio/química , Anidrases Carbônicas/genética , Anidrases Carbônicas/metabolismo , Anotação de Sequência Molecular , Água/metabolismo , Urease
14.
Sci Data ; 11(1): 317, 2024 Mar 27.
Artigo em Inglês | MEDLINE | ID: mdl-38538602

RESUMO

Zacco platypus is an endemic colorful freshwater minnow that is intensively distributed in East Asia. In this study, two adult female individuals collected from Haihe River basin were used for karyotypic study and genome sequencing, respectively. The karyotype formula of Z. platypus is 2N = 48 = 18 M + 24SM/ST + 6 T. We used PacBio long-read sequencing and Hi-C technology to assemble a chromosome-level genome of Z. platypus. As a result, an 814.87 Mb genome was assembled with the PacBio long reads. Subsequently, 98.64% assembled sequences were anchored into 24 chromosomes based on the Hi-C data. The chromosome-level assembly contained 54 scaffolds with a N50 length of 32.32 Mb. Repeat elements accounted for 52.35% in genome, and 24,779 protein-coding genes were predicted, with 92.11% were functionally annotated with the public databases. BUSCO analysis yielded a completeness score of 96.5%. This high-quality genome assembly provides valuable resources for future functional genomic research, comparative genomics, and evolutionary studies of genus Zacco.


Assuntos
Cyprinidae , Animais , Feminino , Ásia Oriental , Cromossomos/genética , Cyprinidae/genética , Genômica , Anotação de Sequência Molecular , Filogenia
15.
Int J Mol Sci ; 25(6)2024 Mar 20.
Artigo em Inglês | MEDLINE | ID: mdl-38542477

RESUMO

Based on Sima and Lu's system of the family Magnoliaceae, the genus Lirianthe Spach s. l. includes approximately 25 species, each with exceptional landscaping and horticultural or medical worth. Many of these plants are considered rare and are protected due to their endangered status. The limited knowledge of species within this genus and the absence of research on its chloroplast genome have greatly impeded studies on the relationship between its evolution and systematics. In this study, the chloroplast genomes of eight species from the genus Lirianthe were sequenced and analyzed, and their phylogenetic relationships with other genera of the family Magnoliaceae were also elucidated. The results showed that the chloroplast genome sizes of the eight Lirianthe species ranged from 159,548 to 159,833 bp. The genomes consisted of a large single-copy region, a small single-copy region, and a pair of inverted repeat sequences. The GC content was very similar across species. Gene annotation revealed that the chloroplast genomes contained 85 protein-coding genes, 37 tRNA genes, and 8 rRNA genes, totaling 130 genes. Codon usage analysis indicated that codon usage was highly conserved among the eight Lirianthe species. Repeat sequence analysis identified 42-49 microsatellite sequences, 16-18 tandem repeats, and 50 dispersed repeats, with microsatellite sequences being predominantly single-nucleotide repeats. DNA polymorphism analysis revealed 10 highly variable regions located in the large single-copy and small single-copy regions, among which rpl32-trnL, petA-psbJ, and trnH-psbA were the recommended candidate DNA barcodes for the genus Lirianthe species. The inverted repeat boundary regions show little variation between species and are generally conserved. The result of phylogenetic analysis confirmed that the genus Lirianthe s. l. is a monophyletic taxon and the most affinal to the genera, Talauma and Dugandiodendron, in Sima and Lu's system and revealed that the genus Lirianthe s. s. is paraphyletic and the genus Talauma s. l. polyphyletic in Xia's system, while Magnolia subsection Gwillimia is paraphyletic and subsection Blumiana polyphyletic in Figlar and Nooteboom's system. Morphological studies found noticeable differences between Lirianthe species in aspects including leaf indumentum, stipule scars, floral orientation, tepal number, tepal texture, and fruit dehiscence. In summary, this study elucidated the chloroplast genome evolution within Lirianthe and laid a foundation for further systematic and taxonomic research on this genus.


Assuntos
Genoma de Cloroplastos , Magnolia , Filogenia , Anotação de Sequência Molecular , Plantas/genética
16.
Sci Data ; 11(1): 322, 2024 Mar 28.
Artigo em Inglês | MEDLINE | ID: mdl-38548787

RESUMO

Oryzias sinensis, also known as Chinese medaka or Chinese ricefish, is a commonly used animal model for aquatic environmental assessment in the wild as well as gene function validation or toxicology research in the lab. Here, a high-quality chromosome-level genome assembly of O. sinensis was generated using single-tube long fragment read (stLFR) reads, Nanopore long-reads, and Hi-C sequencing data. The genome is 796.58 Mb, and a total of 712.17 Mb of the assembled sequences were anchored to 23 pseudo-chromosomes. A final set of 22,461 genes were annotated, with 98.67% being functionally annotated. The Benchmarking Universal Single-Copy Orthologs (BUSCO) benchmark of genome assembly and gene annotation reached 95.1% (93.3% single-copy) and 94.6% (91.7% single-copy), respectively. Furthermore, we also use ATAC-seq to uncover chromosome transposase-accessibility as well as related genome area function enrichment for Oryzias sinensis. This study offers a new improved foundation for future genomics research in Chinese medaka.


Assuntos
Oryzias , Animais , Cromossomos/genética , Genoma , Genômica , Anotação de Sequência Molecular , Oryzias/genética , Filogenia
17.
Nat Commun ; 15(1): 2775, 2024 Mar 30.
Artigo em Inglês | MEDLINE | ID: mdl-38555371

RESUMO

Homologous protein search is one of the most commonly used methods for protein annotation and analysis. Compared to structure search, detecting distant evolutionary relationships from sequences alone remains challenging. Here we propose PLMSearch (Protein Language Model), a homologous protein search method with only sequences as input. PLMSearch uses deep representations from a pre-trained protein language model and trains the similarity prediction model with a large number of real structure similarity. This enables PLMSearch to capture the remote homology information concealed behind the sequences. Extensive experimental results show that PLMSearch can search millions of query-target protein pairs in seconds like MMseqs2 while increasing the sensitivity by more than threefold, and is comparable to state-of-the-art structure search methods. In particular, unlike traditional sequence search methods, PLMSearch can recall most remote homology pairs with dissimilar sequences but similar structures. PLMSearch is freely available at https://dmiip.sjtu.edu.cn/PLMSearch .


Assuntos
Evolução Biológica , Proteínas , Proteínas/química , Anotação de Sequência Molecular , Algoritmos , Análise de Sequência de Proteína
18.
Genome Biol Evol ; 16(3)2024 Mar 02.
Artigo em Inglês | MEDLINE | ID: mdl-38491969

RESUMO

We present the first chromosome-level genome assembly and annotation of the pearly heath Coenonympha arcania, generated with a PacBio HiFi sequencing approach and complemented with Hi-C data. We additionally compare synteny, gene, and repeat content between C. arcania and other Lepidopteran genomes. This reference genome will enable future population genomics studies with Coenonympha butterflies, a species-rich genus that encompasses some of the most highly endangered butterfly taxa in Europe.


Assuntos
Borboletas , Animais , Borboletas/genética , Genoma , Cromossomos/genética , Sintenia , Europa (Continente) , Anotação de Sequência Molecular
19.
Cell Genom ; 4(4): 100527, 2024 Apr 10.
Artigo em Inglês | MEDLINE | ID: mdl-38537634

RESUMO

The seventh iteration of the reference genome assembly for Rattus norvegicus-mRatBN7.2-corrects numerous misplaced segments and reduces base-level errors by approximately 9-fold and increases contiguity by 290-fold compared with its predecessor. Gene annotations are now more complete, improving the mapping precision of genomic, transcriptomic, and proteomics datasets. We jointly analyzed 163 short-read whole-genome sequencing datasets representing 120 laboratory rat strains and substrains using mRatBN7.2. We defined ∼20.0 million sequence variations, of which 18,700 are predicted to potentially impact the function of 6,677 genes. We also generated a new rat genetic map from 1,893 heterogeneous stock rats and annotated transcription start sites and alternative polyadenylation sites. The mRatBN7.2 assembly, along with the extensive analysis of genomic variations among rat strains, enhances our understanding of the rat genome, providing researchers with an expanded resource for studies involving rats.


Assuntos
Genoma , Genômica , Ratos , Animais , Genoma/genética , Anotação de Sequência Molecular , Sequenciamento Completo do Genoma , Variação Genética/genética
20.
Genome Biol Evol ; 16(4)2024 Apr 02.
Artigo em Inglês | MEDLINE | ID: mdl-38546725

RESUMO

Patella caerulea (Linnaeus, 1758) is a mollusc limpet species of the class Gastropoda. Endemic to the Mediterranean Sea, it is considered a keystone species due to its primary role in structuring and regulating the ecological balance of tidal and subtidal habitats. It is currently being used as a bioindicator to assess the environmental quality of coastal marine waters and as a model species to understand adaptation to ocean acidification. Here, we provide a high-quality reference genome assembly and annotation for P. caerulea. We generated ∼30 Gb of Pacific Biosciences high-fidelity data from a single individual and provide a final 749.8 Mb assembly containing 62 contigs, including the mitochondrial genome (14,938 bp). With an N50 of 48.8 Mb and 98% of the assembly contained in the 18 largest contigs, this assembly is near chromosome-scale. Benchmarking Universal Single-Copy Orthologs scores were high (Mollusca, 87.8% complete; Metazoa, 97.2% complete) and similar to metrics observed for other chromosome-level Patella genomes, highlighting a possible bias in the Mollusca database for Patellids. We generated transcriptomic Illumina data from a second individual collected at the same locality and used it together with protein evidence to annotate the genome. A total of 23,938 protein-coding gene models were found. By comparing this annotation with other published Patella annotations, we found that the distribution and median values of exon and gene lengths was comparable with other Patella species despite different annotation approaches. The present high-quality P. caerulea reference genome, available on GenBank (BioProject: PRJNA1045377; assembly: GCA_036850965.1), is an important resource for future ecological and evolutionary studies.


Assuntos
Gastrópodes , Patela , Animais , Concentração de Íons de Hidrogênio , Anotação de Sequência Molecular , Água do Mar , Moluscos/genética , Cromossomos , Gastrópodes/genética
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...